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proposed in the paper is for the compression of digital images. It is aIrOpHIM apH(pMeTH4ecKOrO KOMpOBaHHA MpeqHasHayeH Ayla 
shown how the complexity of the arithmetic coder algorithm de- OxKaTHA UMdposbix w30paxKeHHit. TloKasaHo, KakHM 06pa30M 
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ered). In the course of work, the most computationally complex Mipotrecce paGorst onpeneseHbI HaHGosIee BEIIMCTUTEBHO CHOI 


parts of the arithmetic coder algorithm are determined. Perfor- HbIe 4“acTH aJIroput™Ma apudMerHyeckoro koyepa. Bpmoymena 
mance optimization of their software implementation is carried ONTHMH3alHaA TMpPOW3BOUMTeIbHOCTH MX TIporpaMMHOH peasiu3a- 
out. Codecs with the new algorithm compress photo and video HM 


: : : Koyeku C HOBBIM aJIrOpHTMOM C)KHMarOoT ‘a MOKK BOW 
records obtained through the remote probing of water bodies Ce ee SLIT BEG SEMIN LSE FO 


‘ : pa3HHIbI (pOTO- H BUeCOMaTepHalbl, NOMyYCHHbIe pH WMcTaHiyH- 
without frame-to-frame difference. 
OHHOM 30HJMPOBaHHH BOJHBIX OOBCKTOB. 

Mamepuaaei u memoooi. B npeyctapieHHoli Hay4HoH paOoTe vc- 
satellite images of the Azov Sea area was used. At this, the soft- TIOJIb30BaHa TloqOopKa CITYTHHKOBBIX CHHMKOB aKBaTOpHi A30B- 
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codecs were made to measure the runtime. Comparison of the BaHHOTO KoeKoB. CpaBHeHHe cpeqHero BpeMeHH HX HCIIOuHeHHA 
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average time of their execution showed that the modified codec 
performance is 5.21% higher. The overall performance improve- 
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Introduction. Monitoring of the condition of the water body is often carried out through unmanned aerial vehi- 
cles (UAV), conducting aerial photography in the visible and infrared bands. The performance capabilities of the mo- 
bile camera, which the UAV has, impose a number of restrictions on the equipment that processes and stores the foot- 
age until the UAV returns. Specifically, the following factors should be considered. 

1. Energy efficiency of the equipment that encodes the footage since it depends on the off-line operation time of 
the UAV. 

2. Coding gain of photo and video data during the flight. High resolution images occupy a significant amount of 
the software memory, and this limits the amount of information that a UAV can accumulate. 

Virtually all equipment for photo and video hardware supports the most common codec for compressing JPEG 
images. However, it is inferior to the most modern HEVC and VP9 codecs, which not only support video sequences, but 
also allow for better compression of single images. Thus, GoogleVP9 shows similar JPEG visual quality by the SSIM 
metric (structural similarity) and at the same time compresses images by 25—34% stronger [1]. 

Compared to the same JPEG, HEVC codec enables to improve the compression rate by 10-44% in PSNR metric 
(peak signal-to-noise ratio) [2]. However, a high compression ratio with a smaller bitstream size results in greater com- 


putational complexity of the HEVC and VP9 codecs [3, 4]. The architecture of the JPEG codec is presented in general 
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Fig. 1. JPEG codec flow chart 


Here, VLC means compression through variable-length codes (variable length coding). The input frame is divid- 
ed into blocks of fixed size (8x8). Each of them is subjected to direct discrete cosine transform (DCT), quantization of 
transform coefficients and subsequent entropic compression using the Huffman algorithm [4]. Discrete cosine transform 
is performed in the integer form [5]. Since the adoption of the JPEG standard in 1992, a number of fast algorithms have 
been developed. They allow the conversion to be carried out entirely in the CPU registers. Huffman’s entropy compres- 
sion is also not a complicated problem, so, even mobile processors in program mode can compress and decode JPEG 
images [6]. 

HEVC [6] and VP9 are fundamentally similar, and they are hybrid block codecs with splitting a frame into 
blocks of indeterminate length, intraframe prediction, discrete transform, and subsequent filtering to eliminate blocking 


artifacts. The blockdiagram of the HEVC codec is shown in Fig. 2. 
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Fig. 2. HEVC codec flowchart 


Here, the abbreviation SAO denotes the sample adaptive offset. In addition to the above image reconstruction al- 
gorithms, both codecs use context-adaptive binary arithmetic entropy coding, which is much more complicated than 
Huffman compression. With a high level of visual quality, it is arithmetic coding that occupies a significant part of the 


total decoder operation time. In general, the flowchart of the arithmetic coder is shown in Fig. 3. 
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Fig. 3. Arithmetic coder flowchart 


Codecs of a new generation are supposed to be used, among other things, for compressing images obtained 
through remote sensing of natural objects. In this case, it is necessary to optimize the operation of hybrid block codecs 
to ensure the processing of photo and video UAV data in real time. It is necessary to put more focus on the optimization 
of binary context-adaptive arithmetic coding, due to its essential complexity. 

The objective of this study is to speed up the operation of the arithmetic decoder on mobile processors of the 
ARM architecture. This will improve the performance of Google VP9 video codecs and will allow for their application 
in the UAV on-board electronics for remote sensing of water bodies. The use of enhanced tools for compressing photos 
and videos obtained from aerial photography enables to increase the amount of stored data, improve their visual quality 
and resolution. 

Main Part. Core components of a modern video codec include the binarization of syntactic elements of the 
bitstream and the adaptive binary encoding of these elements into the bitstream. This stage is not safe for the 
vectorization and parallelization, but it can be optimized through the statistical analysis of the input data. The main way 
to predict the running time of a program is to analyze the complexity of the corresponding algorithm. Distinction is 
provided between complexity at the best, worst and average case. x is input data of A algorithm which is used to 
calculate the output of y algorithm. The time function of the algorithm is denoted by CJ (x), and the memory cost 
function — by Cj (x). In the worst case, we will call T,(n) = max),))=n Ci (x) functions of numeric argument the time 
and space complexity A 

Sa(n) = maxyxy=n Ca (x). 

Consider a finite set of n size inputs: 

X, = {x:|lxll = n}. 
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vx € X, corresponds to the probability: 
Pr(x) € [0,1]: Yxex, P(X) = 1. 

The following expectation is called the average-case complexity: 

T, = LixeXn P, (x) Cf (x), 
S4 = eee HCG, ©). 

The described approach is classical for analyzing average-case complexity of the algorithm, and it is described in 
detail in [7, 8]. Note the reasons why the application of this method in practice may be difficult or impractical. 

1. Difference between the number of steps of the algorithm in theory and the number of processor cycles re- 
quired to perform the step in practice. Thus, the majority of modern central processors produce addition and multiplica- 
tion per cycle, while the remainder of the division is calculated for dozens of cycles. 

2. Hardware features of the memory system. Modern computers use a multi-level memory hierarchy. Its compo- 
nents operate at different speeds. Memory calls take considerably longer than register operations. 

3. Optimizing compilers and hardware planners. When building executable files, optimizing compilers transform 
dramatically the code without changing the program state engine. CPU hardware schedulers change the procedure of 
executing instructions for greater performance and predict conditional transitions, while cache controllers read from 
memory in blocks. 

4. In case of the software algorithms implementation on the general-purpose processors, software components 
interfere. For example, the task scheduler shares processor time, and parallel processes that have multiple threads can 
run on a variable number of processor cores. 

The proposed modification of the algorithm complexity serves as a theoretical addition to practical tools for 
measuring performance, such as, for example, profiling and instrumentation of the program code. For the first time, the 
method of splitting the algorithm inputs into complexity classes was presented in [9]. Consider A algorithm and a set of 
all possible inputs: 

G:{91, 92-4, 
and also all possible samples from G, different in size and composition: 
9:{91, Of 3. 
A set of criteria for the complexity measure of the algorithm implementation (for example, the number of processor 
cycles, run-time, etc.): 
«9g; > R. 
A set of complexity measures: 
A: {o1, > ... f. 
This set has the following properties. 
1. W«,,«,€ A: x, #0, — all A elements are different. 
2. W«;€ A splits G into the set of equivalence complexity classes 
G(a;) = {9} N gP...}: 
3. All samples from G(a;) are equally complex: 
aa; € A, gi € G(a), 4: 9f > Hk € [0, I1G(a,)JII]L7 € R. 

Thus, all elements of A are different, and they can be reordered so that the complexity function of the criterion is 
nondecreasing throughout the whole set of criteria. The expected complexity is similar to the estimates of the average- 
case complexity of the algorithm for the discrete and continuous probability of complexity. For the discrete case: 

R(A) = ajeA T; Pi, 
for the continuous case: 
R(A) = Jf, rdF(r). 

The method in question is applicable for analyzing the complexity of an arithmetic codec. The process of entro- 
py compression [10, 11] can be divided into component parts. 

1. Binarization or conversion of the coded character (syntactic element of the compressed bitstream) into a string 
made up of zeros and ones (bit string). 

2. Context modeling for compressing syntactic elements in the normal mode. This step is not performed for syn- 
tactic elements whose statistical distribution is close to normal, and they are coded in the bypass mode. 

3. Arithmetic coding of a bit string. 

Consider in more detail the arithmetic decoding scheme of the Google VP9 codec, namely the part that is associ- 
ated with the subexponential coding of syntactic elements. 
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Imagine an algorithm of the subexponential coding in general [12]. The first step includes the calculation of the 
variables: 
his { kin < 2k 
llog, nJ:n = 2*, 
eo { On <2" 
b-k+1:n2 2k, 
where k is parametrical value, it is 4 for the Google VP9 codec. 

In the second step, the unary code u(u + 1) bit is complemented by n low-order bits. The code length is equal 
to: 

k+1:n< 2* 
2[log, n] —k+2:n > 2*. 

Hence, literal decoding is reduced to decoding the bits that compose it in the loop. To optimize the performance 
of this algorithm, it is important to know the probability distribution of literal lengths. Literals occupying the largest 
number of bits in a compressed bitstream (such as inverse transform coefficients and motion vectors) are coded in se- 
ries, so there is a high probability that the distribution of literal lengths in a compressed bitstream will be constant with 
multiple repeats of elements of the same value. To test this hypothesis, experimental data on the distribution of literal 
lengths in a set of satellite images of the Sea of Azov is acquired (Table 1). 


utitn=| 


Table 1 


Literal length, bit 1 2 3 4 5 6 
Probability, % 0.94 0 67.35 18.25 0 13.46 



































The literals of 3, 4, and 6 bits in length are most probable. The maximum possible literal length for this sequence 
is just 6 bits. This fact is essential for software optimization of the function of subexponential decoding of a literal. As 
part of optimizing a real codec, the run-time criterion is of prime interest. To obtain the set of difficulties: {79,...,7%4}, 
we will profile the program performance. 

A set of unique elements (R) will make up a set of criteria of the run-time complexity (A ). 

The following approaches that are based on the data obtained are applied for the optimization. 

1. Memoization of calculating the literal length to decode a series of literals of equal length. 

2. Unwinding of a cycle of subexponential decoding of a literal. 

3. More efficient algorithm for calculating the number of bits in a literal. 

4. More efficient use of the processor registers immediately within the arithmetic decoding function. 

The implementation of points | and 4 is fairly obvious; therefore, we consider in more detail points 2 and 3. 
Compiling of a decoded literal by bits decoded from a compressed bitstream occurs within the subexponential decoding 
function. In this case, the bottleneck is a loop with varying number of iterations. It can be replaced with a switch-case 
set without break at the end. This technique is known as the Duff’s device method. It allows replacing several loop iter- 
ations through sequentially execution of the instructions without the need for conditional transitions. The bit shift 
amount is a constant that does not need to be read from the register - loop counter. 





Code Listing 1: The original literal decoding function 
static int vp9_read_literal(vp9_reader *br, int bits) 
{ 
int z = 0, bit; 
for (bit = bits — 1; bit >= 0; bit —) 
z |= vp9_read_bit(br) << bit; 
return Z; 


} 














Code Listing 2: Modified literal decoding function 
static int vp9_read_literal(vp9_reader *br, int bits) { 
register int z = 0; 
switch(bits -1){ 

case 6: z |= vp9_read(br, 128) << 6; 

case 5: z |= vp9_read(br, 128) << 5; 

case 4: z |= vp9_read(br, 128) << 4; 
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case 3: z |= vp9_read(br, 128) << 3; 
case 2: z |= vp9_read(br, 128) << 2; 
case 1: z |= vp9_read(br, 128) << 1; 
case 0: z |= vp9_read(br, 128); 
break; 
} 


return Z; 


} 











Another bottleneck is the calculation of the number of literal bits in the while loop [13]. This is a worse solution 
because the number of loop iterations is unpredictable. Instead, a fast bit-counting algorithm was used [14, 15], which 
performs the calculation for a fixed number of steps without conditional transitions. 








Code Listing 3: Fast counting of the number of bits in a literal 
Unsig ned intv; // 32-bit argument 

Register unsig ned intr; // variable for the number of bits 
register unsigned int shift; 

rt = (v > OxFFFF) << 4; 

V>>=F; 

shift = (v > OxFF) << 3; 

v >>= shift; 

r |= shift; 

shift = (v > OxF) << 2; 

v >>= shift; 

r = shift; 

shift = (v > 0x3) << 1; 

v >>= shift; 

r = shift; 

tPive> 1; 














Numerous starts of the reference and modified codecs were made to measure the runtime. In this case, their av- 
erage run-time was compared. It has been found that the performance of the modified codec is 5.21% higher. The in- 
crease in overall performance for arithmetic decoding was 7.33%. 

Conclusions. The operation of the arithmetic coder as a component of the video codec has been optimized 
using the example of the Google VP9 standard. To solve this problem, a modification of the method for analyzing the 
average-case complexity of algorithm has been proposed. The approach is based on the partitioning of the set of inputs 
into equivalence complexity classes. The considered method enables to predict the average-case complexity of the 
algorithm when the number of steps of the algorithm and the time of its execution depend on difficult-to-measure 
parameters, which is typical of the context-adaptive arithmetic coding. The proposed method has been applied to 
optimize the speed of the arithmetic binary coder (using the example of the Google VP9 codec) for the image 
compression problems obtained under remote sensing of water bodies. The research results make it possible to apply 
advanced methods of compressing photo and video data obtained through the aerial photography of water bodies. Thus, 
it is possible to increase an amount of the accumulated data, to improve the visual quality and resolution of the footage 
by 25-34% (according to the SSIM visual quality metric) and to increase the speed of the arithmetic coder by 7%. 
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